LSAP: A Machine Learning Method for Leaf-Senescence-Associated Genes Prediction

Plant leaves, which convert light energy into chemical energy, serve as a major food source on Earth. The decrease in crop yield and quality is caused by plant leaf premature senescence. It is important to detect senescence-associated genes. In this study, we collected 5853 genes from a leaf senescence database and developed a leaf-senescence-associated genes (SAGs) prediction model using the support vector machine (SVM) and XGBoost algorithms. This is the first computational approach for predicting SAGs with the sequence dataset. The SVM-PCA-Kmer-PC-PseAAC model achieved the best performance (F1score = 0.866, accuracy = 0.862 and receiver operating characteristic = 0.922), and based on this model, we developed a SAGs prediction tool called “SAGs_Anno”. We identified a total of 1,398,277 SAGs from 3,165,746 gene sequences from 83 species, including 12 lower plants and 71 higher plants. Interestingly, leafy species showed a higher percentage of SAGs, while leafless species showed a lower percentage of SAGs. Finally, we constructed the Leaf SAGs Annotation Platform using these available datasets and the SAGs_Anno tool, which helps users to easily predict, download, and search for plant leaf SAGs of all species. Our study will provide rich resources for plant leaf-senescence-associated genes research.


Introduction
Plant leaves, which convert light energy into chemical energy, are the main organ for photosynthesis and serve as a major food source on Earth [1]. There has been an increasing concern regarding the decrease in crop yield caused by premature senescence [2]. Many advances in the understanding of the molecular mechanisms of leaf senescence have been achieved, revealing that a large number of senescence-associated genes (SAGs) regulate leaf senescence [1,2]. A leaf senescence database (LSD: https://ngdc.cncb.ac.cn/lsd/, accessed on 1 May 2022) was constructed in 2010 to facilitate systematic studies of leaf senescence. The LSD 3.0 database, presented in 2020, integrates a comprehensive collection of 5853 genes and 617 mutants from 68 species, which provides scientists with useful resources for studies of leaf senescence [3].
Currently, senescence-associated genes are found mainly through biological experiments, which are complex, costly, and labor-and time-intensive. To solve this problem, the use of computational and mathematical methods lies among the most promising alternatives, such as intelligent data mining and knowledge discovery. Machine learning (ML), as a part of artificial intelligence, "learns" a model from empirical data using statistical, probabilistic, and optimization methods in order to predict future data [4]. The support vector machine (SVM), one of many ML methods, is a supervised machine learning technique for classification tasks [4]. The XGBoost algorithm, an integrated learning method, has a stronger generalization ability to obtain better modeling effects. ML has been successfully applied to many bioinformatics problems. For example, Bari et al. built an SVM model to predict a new class of cancer-related genes that were neither differentially expressed nor mutated [5].
Identification of leaf-senescence-associated genes through wet-lab experiments requires more time, human resources, and financial resources. No computational method based on SAGs protein sequence data is available, and that motivated us to propose the present computational method to identify the proteins encoded by the leaf-senescenceassociated genes. In this study, we present the Senescence-Associated Genes Annotation Tool (SAGs_Anno), a machine learning method to predict senescence-associated genes from protein sequences. The "SAGs_Anno" tool was developed based on the SVM-PCA-Kmer-PC-PseAAC model (F1score = 0.866, ACC = 0.862 and AUC = 0.922). To facilitate the scientific use of "SAGs_Anno", we developed the Leaf SAGs Annotation Platform (LSAP: http://www.sagsanno.top:8080/LSAP/index.jsp, accessed on 5 June 2022), based on the "SAGs_Anno" tool. We believe that the LSAP database can be a useful platform for the leaf senescence research community.

Collection of Datasets and Preprocessing
The LSD 3.0 database, presented in 2020, integrates a comprehensive collection of 5853 genes and 617 mutants from 68 species, which provides scientists with useful resources for systematical studies of leaf senescence [3]. The positive data were downloaded from LSD 3.0 (https://ngdc.cncb.ac.cn/lsd/, accessed on 5 May 2022) and further compared to the Pfam (http://pfam.xfam.org/, accessed on 5 May 2022) database [6] using Perl scripts. The positive data included 1638 gene families. Negative data, including 16,291 gene families, were downloaded from the Pfam (http://pfam.xfam.org/, accessed on 5 May 2022) database using Python scripts. To clean the data, we removed the records that contained residues B, J, O, U, X and Z. Additionally, we removed sequences that contained less than 50 amino acids. We also removed the redundant sequences using the CD-HIT program [7] with a threshold of 0.7. Eventually, the filtered dataset contained 6377 and 15,278 protein sequences, which were used to build the classification model.

Features Selection
In this study, three kinds of features, including Kmer, parallel correlation pseudo amino acid composition (PC-PseAAC), and auto-cross covariance (ACC) were employed to construct the SAGs_Anno predictor. Pse-in-one 2.0 software [8], implemented in the Pse-in-One 2.0 database (http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/, accessed on 26 May 2022), was used to extract features. The nac.py script with k-mer = 2 was used to extract Kmer features. The pse.py script, using the parameters lambda = 2, w = 0.05, was used to extract PC-PseAAC features. Additionally, the ACC features were extracted using acc.py script with LAG = 3.

Machine Learning Model Development
The machine learning prediction model contains many parameters. We needed to determine the optimal values of the parameters through training optimization. We used two machine learning algorithms, namely SVM, provided by the auto-sklearn v0.12.7 package, and XGBoost, provided by the xgboost v1.5.2 package.

Performance Evaluation
In this study, we used fivefold cross-validation to evaluate the performance of our model. We used the F1score, ACC, and AUC as indicators to systematically evaluate the performance of the models from different aspects. We used the pROC v1. 16 where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.  [25] (https://banana-genome-hub.southgreen.fr/, accessed on 5 May 2022), respectively. To clean the data, we removed the records that contained unknown amino acids using Python codes. The Pse-in-One 2.0 [8] tools were used to extract Kmer and PC-PseAAC features. Based on our presented SVM-PCA-Kmer-PC-PseAAC model, we large-scale predicted plant SAGs from 83 plants.

Database Construction
The LASP (http://www.sagsanno.top:8080/LSAP/index.jsp, accessed on 5 June 2022) database was created by integrating a variety of bioinformatics programs on the Linux platform. This system is set up on an Aliyun server and uses Apache Tomcat as a web server. The collected data were processed using Python codes. All datasets were integrated into the MySQL database. Java, HTML5, JavaServer Pages, CSS3, and jQuery were used to transmit query requirements and extract plant SAGs data from the MySQL database to show in the report pages.

The Results of SVM Performances
A support vector machine, widely used in bioinformatics and computational biology, is a supervised machine learning technique for classification tasks [4]. The filtered dataset contained 6377 and 15,278 protein sequences, and 80% of the dataset was used to build the classification model. In this study, seven kinds of features, including Kmer, PC-PseAAC, AAC, Kmer-PC-PseAAC, Kmer-AAC, PC-PseAAC-AAC, and Kmer-PC-PseAAC-AAC, were employed to build the SAGs_Anno prediction tool using the SVM method. For SVM, we tuned three hyperparameters, including cost, gamma, and kernel, and we optimized them by using a grid search. Then, 20% of the filtered dataset was used to evaluate the prediction model. Tables 1 and S1 show the performance of the best prediction model, from which we can see that SVM-Kmer-PC-PseAAC achieved the best performance (F1score = 0.851, ACC = 0.854 and AUC = 0.925), followed by the SVM-PC-PseAAC model (F1score = 0.838, ACC = 0.833 and AUC = 0.900). Principal component analysis (PCA) is very effective method for data dimension reduction and feature extraction. To further improve the SAGs prediction model, we selected four kinds of combined features, including Kmer-PC-PseAAC, Kmer-AAC, PC-PseAAC-AAC, and Kmer-PC-PseAAC-AAC, and used the PCA method to calculate the discriminative weight vectors in the features space. We chose different dimensional combined features to train the prediction model. Figure S1 shows the performance of the best predictive model. The results show that in four kinds of combined features, Kmer-PC-PseAAC, Kmer-AAC, PC-PseAAC-AAC, and Kmer-PC-PseAAC-AAC, the most discriminative dimensional features numbers are 410, 401, 46 and 161, respectively. Considering task complexity and runtime, we only considered the most discriminative dimensional features to further improve the SAGs prediction model. We trained four predictive models using different combinations of features sets. We tuned three hyperparameters, including cost, gamma, and kernel, and we optimized them using a grid search. The specific features for each combination and number of features, as well as the F1score, ACC, and AUC scores, are shown in Tables 2 and S1, from which we can see that the SVM-PCA-Kmer-PC-PseAAC model achieved the best performance (F1score = 0.866, ACC = 0.862 and AUC = 0.922), which is better than the SVM-Kmer-PC-PseAAC model.

The Results of XGBoost Performances
The XGBoost algorithm is based on an integrated learning method, which is widely used in the bioinformatics field. In this study, we also used the XGBoost algorithm to train the classification model. For seven kinds of features, including Kmer, PC-PseAAC, AAC, Kmer-PC-PseAAC, Kmer-AAC, PC-PseAAC-AAC, and Kmer-PC-PseAAC-AAC, we trained seven predictive models using the XGBoost algorithm. We tuned six hyperparameters, including max_depth, subsample, min_child_weight, colsample_bytree, gamma, and learning_rate, and optimized them by using a grid search. The performances of the seven predictive models are shown in Tables 3 and S2. The XGBoost-Kmer-PC-PseAAC-AAC model achieved the best performance (F1score = 0.865, ACC = 0.860 and AUC = 0.925). For four kinds of combined features, we also used the PCA method to calculate the discriminative weight vectors in the features space, and we chose different dimensional combined features to train the prediction model. The results show that within the four kinds of combined features, Kmer-PC-PseAAC, Kmer-AAC, PC-PseAAC-AAC, and Kmer-PC-PseAAC-AAC, the most discriminative dimensional features numbers are 212, 411, 46 and 425, respectively ( Figure S2). After dimension was reduced using the PCA method, the prediction model performance did not improve (Tables 4 and S2).

A Plant SAGs Predict Tool for Users
We built 22 machine learning models based on two types of learning algorithms: SVM and XGBoost (Tables 1-4). We can see that the SVM-PCA-Kmer-PC-PseAAC model achieved the best performance, followed by the XGBoost-Kmer-PC-PseAAC-AAC model. Based on the SVM-PCA-Kmer-PC-PseAAC computational model, we developed a tool called "SAGs_Anno" (http://www.sagsanno.top:8080/LSAP/DownloadDetail_detail.action?download_fileType= SAGs_Anno, accessed on 5 June 2022) for proteome-wide identification of proteins encoded by the plant leaf-senescence-associated genes. We also provide instructions on how to use this tool. There are three main functions of this tool: New_data_dealing.py, Pre_SAGs.py, and Pre_result_id.py. Using New_data_dealing.py script, users can remove sequences with residues B, J, O, U, X and Z. After removing such sequences, users can extract Kmer and PC-PseAAC features using Pse-in-One 2.0 tools. With the function Pre_SAGs.py, users can predict plant SAGs based on the SVM-PCA-Kmer-PC-PseAAC computational model. Then, users can extract SAGs id using Pre_result_id.py script. In summary, the developed prediction tool will be of great help to researchers working in the field of identifying plant leaf-senescence-associated genes via wet-lab experiments.

Large-Scale Prediction SAGs
We collected the protein sequences dataset of 83 examined species, which contained 12 lower plants and 71 higher plants, from a public database. The higher plants were further divided into 49 eudicots, 18 monocots, and 4 other higher plants. We identified a total of 1,398,277 SAGs from 3,165,746 gene sequences of 83 species (Table S3).
About half of the species belonged to horticultural plants (Table S3) The average SAGs number was 16,846.71, and most species (79, 95.18%) had the SAGs with a number larger than 1000 (Table S3). The average SAGs percentage was 41.92%, and only five species (6.02%) had SAGs with a percentage less than 25%, including Chlorella variabilis, Cyanidioschyzon merolae, Chlamydomonas reinhardtii, Dunaliella salina, and Coccomyxa subellipsoidea, which belonged to lower plants.

Comparative Analysis of SAGs in Plants
More SAGs were detected in higher plants than in lower plants. The average SAGs number in higher plants (19,343.97) was 9.5 times that of the average SAGs number in lower plants (2036.08), which may be due to whole-genome duplication and whole-genome triplication events that occurred in most higher plants. Among the top 10 species with a higher percentage of SAGs, all species belonged to the higher plants. Interestingly, of these 10 species, all belonged to eudicots plants. This phenomenon suggests that eudicots plants might contain a higher proportion of SAGs than monocot and other higher plants. All  Interestingly, leafiness species showed a higher percentage of SAGs and leafless species showed a lower percentage of SAGs. This phenomenon suggests that genes and plant phenotypes have the same evolutionary trend.

Plant Leaf SAGs Database Construction
Using these available datasets, we constructed the Leaf SAGs Annotation Platform (LSAP: http://www.sagsanno.top:8080/LSAP/index.jsp, accessed on 5 June 2022), which helps users to easily predict, download, and search for plant leaf SAGs of all species. The LSAP structure has a user-friendly interface and consists of seven main modules, including Home, Specie, Download, SAGs_Anno, Userguide, Submit and Links (Figure 1).

SAGs_Anno
Based on the "SAGs_Anno" tool that we developed, we provide an online prediction plant leaf SAGs service using Java, HTML5, and JavaScript. Users only need to supply amino acid sequences in FASTA format, and upon submitting the task. The prediction results can be browsed and downloaded from the results interface ( Figure 2).

Browse Examined Species SAGs Dataset
Here, we identified a total of 1,398,277 plant leaf SAGs from 3,165,746 gene sequences of 83 species. The complete SAGs dataset was integrated into the species module. We provided detailed information for each species, including gene identification, coding sequences, protein sequences, and the total number. Scientists can browse and access detailed information about the desired SAGs dataset by clicking the species name.

Download
The Download module has two divisions: the SAGs_Anno and the SAGs dataset. The prediction tool "SAGs_Anno" can be obtained from the SAGs_Anno division. We also provide instructions on how to use this tool in the SAGs_Anno division. The SAGs_Anno division provides the SVM-PCA-Kmer-PC-PseAAC machine learning models. In this division, we also provide positive and negative datasets for the training module. The SAGs module displays a total of 1,398,277 plant leaf SAGs and 83-species SAGs dataset, which contains 12 lower plants and 71 higher plants.

Userguide, Submit, Home, and Links
To help users to access the LSAP database, we provide instructions on how to use this platform. The "Contact Us" function provided at the bottom of every interface contains an e-mail address and phone number to allow users to contact us conveniently and quickly. In the future, we will continue to identify plant leaf SAGs from protein datasets of sequenced species and add them to our LSAP database. To encourage users to submit a new plant leaf SAGs dataset to us, a "Submit" function was embedded in the LSAP. We welcome suggestions from scientists all over the world to further improve our database. We believe that our database will be useful to all researchers.

SAGs_Anno
Based on the "SAGs_Anno" tool that we developed, we provide an online prediction plant leaf SAGs service using Java, HTML5, and JavaScript. Users only need to supply amino acid sequences in FASTA format, and upon submitting the task. The predic-

Browse Examined Species SAGs Dataset
Here, we identified a total of 1,398,277 plant leaf SAGs from 3,165,746 gene sequences of 83 species. The complete SAGs dataset was integrated into the species module. We provided detailed information for each species, including gene identification, coding sequences, protein sequences, and the total number. Scientists can browse and access detailed information about the desired SAGs dataset by clicking the species name.

Download
The Download module has two divisions: the SAGs_Anno and the SAGs dataset. The prediction tool "SAGs_Anno" can be obtained from the SAGs_Anno division. We also provide instructions on how to use this tool in the SAGs_Anno division. The SAGs_Anno division provides the SVM-PCA-Kmer-PC-PseAAC machine learning models. In this division, we also provide positive and negative datasets for the training module. The SAGs module displays a total of 1,398,277 plant leaf SAGs and 83-species SAGs dataset, which contains 12 lower plants and 71 higher plants.

Userguide, Submit, Home, and Links
To help users to access the LSAP database, we provide instructions on how to use this platform. The "Contact Us" function provided at the bottom of every interface contains an e-mail address and phone number to allow users to contact us conveniently and quickly. In the future, we will continue to identify plant leaf SAGs from protein datasets of sequenced species and add them to our LSAP database. To encourage users to submit a new plant leaf SAGs dataset to us, a "Submit" function was embedded in the LSAP. We welcome suggestions from scientists all over the world to further improve our database. We believe that our database will be useful to all researchers.

Discussion
In this study, we presented a novel computational approach to the recognition of proteins encoded by plant leaf-senescence-associated genes. Compared with biological experiments, this method has the advantages of fast, easy, and inexpensive identification of SAGs. The experimental results showed that our method has a good performance (F1score = 0.866, ACC = 0.862 and AUC = 0.922). The BLAST program [26] has a low recognition rate for non-homology sequences. Compared with the BLAST program, our method has the advantages of high-efficiency and fast identification of SAGs. This is the first computational approach to predicting SAGs with the sequence dataset. Based on the SVM-PCA-Kmer-PC-PseAAC computational model, we presented a tool, "SAGs_Anno", for the proteome-wide identification of proteins encoded by the plant leaf-senescenceassociated genes. We believe that this tool will be of great help to the plant SAGs scientific community. We also predicted large-scale SAGs from protein datasets, which were collected from a public database, and a total of 1,398,277 SAGs were identified from 3,165,746 gene sequences of 83 species. Interestingly, leafy species showed a higher percentage of SAGs and leafless species showed a lower percentage of SAGs. This phenomenon suggests that genes and plant phenotypes have the same evolutionary trend.
Using these available datasets, we constructed the Leaf SAGs Annotation Platform (LSAP: http://www.sagsanno.top:8080/LSAP/index.jsp, accessed on 5 June 2022), which helps users to easily predict, download, and search plant leaf SAGs of all species. We believe that LSAP will be of great help to all researchers. The uncertainty of a negative dataset is the primary weakness of our method, and we will improve the performance of our method when the LSD database is updated. In the future, more effective features and deep learning techniques, such as convolutional neural networks, recurrent neural networks, and multilayer perceptrons, will be explored to improve our prediction model. In conclusion, this study will serve as a useful resource for future studies on plant leafsenescence-associated genes.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/life12071095/s1, Figure S1: The most discriminative dimensional features number using SVM algorithm; Figure S2: The most discriminative dimensional features number using XGBoost algorithm; Table S1: The hyperparameters of SVM predictive model; Table S2: The hyperparameters of XGBoost predictive model; Table S3: The SAGs data of 83 examined species.
Author Contributions: Z.L., W.T., X.Y. and X.H. designed this study. Z.L., W.T. and X.Y. performed the experiments. Z.L. and X.H. prepared the article. All authors have read and agreed to the published version of the manuscript.